MISC

# regarding 'readthedown' theme
# https://cran.r-project.org/web/packages/rmdformats/vignettes/introduction.html

Introduction

Check out this Kaggle

This data has been gathered at two solar power plants in India over a 34 day period. It has two pairs of files - each pair has one power generation dataset and one sensor readings dataset. The power generation datasets are gathered at the inverter level - each inverter has multiple lines of solar panels attached to it. The sensor data is gathered at a plant level - single array of sensors optimally placed at the plant.

There are a few areas of concern at the solar power plant -

  1. Can we predict the power generation for next couple of days? - this allows for better grid management
  2. Can we identify the need for panel cleaning/maintenance?
  3. Can we identify faulty or sub-optimally performing equipment?

Data Dictionary for Power Generation data sets

  1. AC_POWER : Amount of AC power generated by the inverter (source_key) in this 15 minute interval. Units - kW.
  2. AC_ : Amount of DC power generated by the inverter (source_key) in this 15 minute interval. Units - kW.
  3. DAILY_YIELD : Daily yield is a cumulative sum of power generated on that day, till that point in time.
  4. DATE_TIME : Date and time for each observation. Observations recorded at 15 minute intervals.
  5. PLANT_ID : Plant ID - this will be common for the entire file.
  6. SOURCE_KEY : Source key in this file stands for the inverter id.
  7. TOTAL_YIELD : This is the total yield for the inverter till that point in time.

Data Dictionary for Sensor Reading data sets

  1. IRRADIATION: Amount of irradiation for the 15 minute interval.
  2. DATE_TIME: Date and time for each observation. Observations recorded at 15 minute intervals.
  3. PALNT_ID: Plant ID - this will be common for the entire file.
  4. SOURCE_KEY: Stands for the sensor panel id. This will be common for the entire file because there’s only one sensor panel for the plant.
  5. MODULE_TEMPERATURE: There’s a module (solar panel) attached to the sensor panel. This is the temperature reading for that module.
  6. AMBIENT_TEMPERATURE: This is the ambient temperature at the plant.

Power Generation data set: Plant 1

Get and Split Data

p1.gd = read_csv('Plant_1_Generation_Data.csv') %>%
  slice_sample(prop = 0.10) %>% #!!<NOTE>temp, working with a sample of datset for speed purposes
  clean_names() %>%  #lowercase
  select(sort(tidyselect::peek_vars())) %>% #sort cols alphabetically
  select(where(is.factor),where(is.character),where(is.numeric)) #sort cols by data type

#OlsonNames()
#https://stackoverflow.com/questions/41479008/what-is-the-correct-tz-database-time-zone-for-india

p1.gd = p1.gd %>% mutate(
  date_time = as.POSIXct(strptime(p1.gd$date_time, "%d-%m-%Y %H:%M"), tz = 'Asia/Kolkata'),
  source_key = factor(p1.gd$source_key),
  source_key = factor(p1.gd$source_key)
) %>% rename(inverter = source_key)

p1.gd$plant_id = NULL

glimpse structure, and sample rows

p1.gd %>% slice_sample(n = 10) %>% DT::datatable()
p1.gd %>% glimpse()
## Rows: 6,877
## Columns: 6
## $ date_time   <dttm> 2020-06-10 13:45:00, 2020-05-30 19:15:00, 2020-05-15 0...
## $ inverter    <fct> pkci93gMrogZuBj, zVJPv84UY57bAof, iCRJl6heRkivqQ3, adLQ...
## $ ac_power    <dbl> 872.11429, 0.00000, 517.60000, 367.97500, 169.15714, 0....
## $ daily_yield <dbl> 4922.42857, 7290.00000, 941.00000, 276.00000, 73.71429,...
## $ dc_power    <dbl> 8915.7143, 0.0000, 5278.8571, 3748.0000, 1728.7143, 0.0...
## $ total_yield <dbl> 7367638, 7235977, 7178933, 6505362, 7020586, 6649525, 7...

check missing values

p1.gd %>% miss_var_summary()
## # A tibble: 6 x 3
##   variable    n_miss pct_miss
##   <chr>        <int>    <dbl>
## 1 date_time        0        0
## 2 inverter         0        0
## 3 ac_power         0        0
## 4 daily_yield      0        0
## 5 dc_power         0        0
## 6 total_yield      0        0

EDA: Factor Vars

counts each factor’s unique levels

sapply(p1.gd %>% select(where(is.factor)), n_unique) %>% as.data.frame()
##           .
## inverter 22

reference: names of unique levels

sapply(p1.gd %>% select(where(is.factor)), unique) %>% as.data.frame() %>% arrange()
##           inverter
## 1  pkci93gMrogZuBj
## 2  zVJPv84UY57bAof
## 3  iCRJl6heRkivqQ3
## 4  adLQvlD726eNBSB
## 5  3PZuoBAID5Wc2HD
## 6  ZnxXDlPa8U1GXgE
## 7  uHbuxQJl8lW7ozc
## 8  sjndEbLyjtCKgGv
## 9  1IF53ai7Xc0U56Y
## 10 rGa61gmuvPhdLxV
## 11 ZoEaEvLYb1n2sOq
## 12 WRmjgnKYAwPKWDb
## 13 7JYdWkrLSPkdwr4
## 14 ih0vzX44oOqAx2f
## 15 z9Y9gH1T5YWrNuG
## 16 wCURE6d3bPkepu2
## 17 zBIq5rxdHJRwDNY
## 18 bvBOhCH3iADSZry
## 19 VHMLBKoKgIrUVDU
## 20 YxYtjZvoooNbGkE
## 21 1BY6WEcLGh8j5v7
## 22 McdE0feGgRqW7Ca

viz: distribution of level counts

jpal = colorRampPalette(brewer.pal(8,'Dark2'))(22)

p1.gd %>% count(inverter) %>% plot_ly(y = ~fct_reorder(inverter,n), x = ~n, color = ~inverter, colors = jpal) %>% add_bars(hoverinfo = 'text', text = ~n) %>% hide_legend() %>% layout(
    title = 'Source Key Counts',
    xaxis = list(title = ''),
    yaxis = list(title = '')
    ) 
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Sanity Check - PASS - Pretty much a uniform distribution

EDA: Numeric Vars

viz bivariate numeric distribution

DataExplorer::plot_boxplot(p1.gd %>% select(where(is.numeric)), by = 'daily_yield')

DataExplorer::plot_boxplot(p1.gd %>% select(where(is.numeric)), by = 'total_yield')

viz: numeric univariate distributions

names.numeric = p1.gd %>% select(where(is.numeric)) %>% names

p1.gd %>% dlookr::plot_normality(
  names.numeric[1],
  names.numeric[2],
  names.numeric[3],
  names.numeric[4]
  )

Seems to be many outliers, especially 0s, in several features. Indicative of faulty inverters? Plot by inverter

viz: numeric univariate distributions

DataExplorer::plot_density(p1.gd %>% select(where(is.numeric)))

viz: distributions by ‘inverter’ factor

p1.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = total_yield, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~total_yield, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of Total Yield by Inverter')
p1.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = daily_yield, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~daily_yield, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of Daily Yield by Inverter')
p1.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = ac_power, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~ac_power, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of AC Power by Inverter')
p1.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = dc_power, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~dc_power, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of DC Power by Inverter')

correlations: viz

p1.gd %>% dlookr::plot_correlate()

since dc and ac power are just perfectly convertible (like fahrenheit and celsius), they have a perfect correlation

EDA: Time Series Viz

Anomoly Plot

library(scales)
library(anomalize)
# anomalize(data, target, method = c("iqr", "gesd"), alpha = 0.05, max_anoms = 0.2, verbose = FALSE)

# alpha: Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations.
# max_anoms: The maximum percent of anomalies permitted to be identified.

p1.gd.anomalize = p1.gd %>% arrange(date_time) %>% 
  mutate(inverter = fct_reorder(inverter, -daily_yield)) %>% 
  group_by(inverter) %>%
  time_decompose(daily_yield, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.05, method = 'gesd') %>%
  time_recompose()

ggplotly(
  p1.gd.anomalize %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'daily yield')
  ) %>% layout(showlegend = FALSE)

Power Generation data set: Plant 2

Get and Split Data

p2.gd = read_csv('Plant_2_Generation_Data.csv') %>%
  slice_sample(prop = 0.10) %>% #!!<NOTE>temp, working with a sample of datset for speed purposes
  clean_names() %>%  #lowercase
  select(sort(tidyselect::peek_vars())) %>% #sort cols alphabetically
  select(where(is.POSIXct), where(is.factor),where(is.character),where(is.numeric)) #sort cols by data type

#OlsonNames()
#https://stackoverflow.com/questions/41479008/what-is-the-correct-tz-database-time-zone-for-india

p2.gd = p2.gd %>% mutate(
  source_key = factor(p2.gd$source_key),
  source_key = factor(p2.gd$source_key)
) %>% rename(inverter = source_key)

p2.gd$plant_id = NULL

glimpse structure, and sample rows

p2.gd %>% slice_sample(n = 10) %>% DT::datatable()
p2.gd %>% glimpse()
## Rows: 6,769
## Columns: 6
## $ date_time   <dttm> 2020-05-26 10:00:00, 2020-05-19 23:00:00, 2020-06-16 1...
## $ inverter    <fct> LlT2YUhhzqhg5Sw, oZZkBaNadn6DNKz, LYwnQax7tkwH5Cb, V94E...
## $ ac_power    <dbl> 1032.61333, 0.00000, 0.00000, 289.83333, 62.58667, 0.00...
## $ daily_yield <dbl> 2028.13333, 1999.00000, 5208.00000, 240.80000, 2693.733...
## $ dc_power    <dbl> 1056.38667, 0.00000, 0.00000, 295.21333, 64.62667, 0.00...
## $ total_yield <dbl> 282665635, 1708110791, 1795111912, 1412283025, 16601787...

check missing values

p2.gd %>% miss_var_summary()
## # A tibble: 6 x 3
##   variable    n_miss pct_miss
##   <chr>        <int>    <dbl>
## 1 date_time        0        0
## 2 inverter         0        0
## 3 ac_power         0        0
## 4 daily_yield      0        0
## 5 dc_power         0        0
## 6 total_yield      0        0

EDA: Factor Vars

counts each factor’s unique levels

sapply(p2.gd %>% select(where(is.factor)), n_unique) %>% as.data.frame()
##           .
## inverter 22

reference: names of unique levels

sapply(p2.gd %>% select(where(is.factor)), unique) %>% as.data.frame() %>% arrange()
##           inverter
## 1  LlT2YUhhzqhg5Sw
## 2  oZZkBaNadn6DNKz
## 3  LYwnQax7tkwH5Cb
## 4  V94E5Ben1TlhnDV
## 5  oZ35aAeoifZaQzV
## 6  rrq4fwE8jgrTyWY
## 7  xoJJ8DcxJEcupym
## 8  NgDl19wMapZy17u
## 9  Quc1TzYxW2pYoWX
## 10 WcxssY2VbP4hApt
## 11 IQ2d7wF4YD8zU1Q
## 12 Et9kgGMDl729KT4
## 13 Qf4GUc1pJu5T6c6
## 14 mqwcsP2rE7J0TFp
## 15 PeE6FRyGXUgsRhN
## 16 4UPUqMRk7TRMgml
## 17 9kRcWv60rDACzjR
## 18 Mx2yZCDsyf6DPfv
## 19 81aHJ1q11NBPMrL
## 20 vOuJvMaM2sgwLmb
## 21 q49J1IKaHRwDQnt
## 22 xMbIugepa2P7lBB

viz: distribution of level counts

jpal = colorRampPalette(brewer.pal(8,'Dark2'))(22)

p2.gd %>% count(inverter) %>% plot_ly(y = ~fct_reorder(inverter,n), x = ~n, color = ~inverter, colors = jpal) %>% add_bars(hoverinfo = 'text', text = ~n) %>% hide_legend() %>% layout(
    title = 'Source Key Counts',
    xaxis = list(title = ''),
    yaxis = list(title = '')
    ) 

Sanity Check - PASS - Pretty much a uniform distribution

EDA: Numeric Vars

viz bivariate numeric distribution

DataExplorer::plot_boxplot(p2.gd %>% select(where(is.numeric)), by = 'daily_yield')

DataExplorer::plot_boxplot(p2.gd %>% select(where(is.numeric)), by = 'total_yield')

viz: numeric univariate distributions

names.numeric = p2.gd %>% select(where(is.numeric)) %>% names

p2.gd %>% dlookr::plot_normality(
  names.numeric[1],
  names.numeric[2],
  names.numeric[3],
  names.numeric[4]
  )

Seems to be many outliers, especially 0s, in several features. Indicative of faulty inverters? Plot by inverter

viz: numeric univariate distributions

DataExplorer::plot_density(p2.gd %>% select(where(is.numeric)))

viz: distributions by ‘inverter’ factor

p2.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = total_yield, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~total_yield, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of Total Yield by Inverter')
p2.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = daily_yield, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~daily_yield, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of Daily Yield by Inverter')
p2.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = ac_power, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~ac_power, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of AC Power by Inverter')
p2.gd %>% mutate(inverter = fct_reorder(.f = inverter, .x = dc_power, .fun = median, .desc = TRUE)) %>%
  plot_ly(y = ~inverter, x = ~dc_power, color = ~inverter, colors = jpal) %>% add_boxplot()%>%
  hide_legend() %>% layout(xaxis = list(title = ''), yaxis = list(title = ''), title = 'Distribution of DC Power by Inverter')

correlations: viz

p2.gd %>% dlookr::plot_correlate()

since dc and ac power are just perfectly convertible (like fahrenheit and celsius), they have a perfect correlation

EDA: Time Series Viz

Anomoly Plot

library(scales)
library(anomalize)
# anomalize(data, target, method = c("iqr", "gesd"), alpha = 0.05, max_anoms = 0.2, verbose = FALSE)

# alpha: Controls the width of the "normal" range. Lower values are more conservative while higher values are less prone to incorrectly classifying "normal" observations.
# max_anoms: The maximum percent of anomalies permitted to be identified.

p2.gd.anomalize = p2.gd %>% arrange(date_time) %>% 
  mutate(inverter = fct_reorder(inverter, -daily_yield)) %>% 
  group_by(inverter) %>%
  time_decompose(daily_yield, method = 'twitter', merge = TRUE) %>%
  anomalize(remainder, alpha = 0.05, method = 'gesd') %>%
  time_recompose()

ggplotly(
  p2.gd.anomalize %>%
    plot_anomalies(
      ncol = 2,
      alpha_dots = 0.5,
      alpha_circles = 0.5,
      size_circles = 2,
      time_recomposed = TRUE,
      alpha_ribbon = 0.05
      ) + scale_y_continuous(labels = comma) +
    labs(x = '', y = 'daily yield')
  ) %>% layout(showlegend = FALSE)